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Abstract We consider the Wright-Fisher model for a population of N individuals, each identified 
Pj^ ■ with a sequence of a finite number of sites, and single-crossover recombination between them. We 

trace back the ancestry of single individuals from the present population. In the N — s- oo limit 
without rescaling of parameters or time, this ancestral process is described by a random tree, whose 
branching events correspond to the splitting of the sequence due to recombination. With the help 
of a decomposition of the trees into subtrees, we calculate the probabilities of the topologies of 
the ancestral trees. At the same time, these probabilities lead to a semi-explicit solution of the 
deterministic single-crossover equation. The latter is a discrete-time dynamical system that emerges 
from the Wright-Fisher model via a law of large numbers and has been waiting for a solution for 
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1 Introduction 

Recombination happens during sexual reproduction and refers to the combination of the genetic 



. material of two parents into the 'mixed' type of an offspring individual. More precisely, the re- 

combined offspring results from a reciprocal exchange of maternal and paternal gene sequences via 
so-called crossovers. Due to the interaction of individuals and due to dependencies between the po- 
sitions at which recombination may take place, the process is difficult to handle. This applies even 
to the limit of infinite population size, where a law of large numbers turns the dynamics of gene 
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frequencies into a deterministic, nonlinear system of difference or differential equa tions, wh i ch ha s 
challenged population genetic is ts since i t s first formu lation by Geiringer in 1944; see Bennett ( 1954^) : 



McHale and RingwoodT(|l983|) : IPawsonl (|200ol I2OO2I) : and iBaake and Baake! ((2003^ for a sample of 



subsequent work. Under the so-called single-crossover assumption, where at most one crossover oc- 
curs in any gene sequence in every generation, the d eterministic model can be solved explicitly (and 
in an astonishingly simple way) in continuous time ( Baake and Baake 2003). But the corresponding 
discrete-time dynamics, which is prevalent in the biological literature, is more difficult; its solution 
has, s o far, required nontrivial transformations and recursions that have not yet been solved in closed 
form (jBennettlll954llDawson1l2000[l2002l:lvon Wangenheim et al.lboiol) . 

in this paper, we will present a semi-explicit solution to the discrete-time single-crossover population 
model by considering the ancestry of single individuals. The original deterministic forward-time dynamics 
is thus considered in terms of a stochastic process backward in time, whose solution leads to that of 
the original system. In the backward process, one gains a certain conditional independence of gene 
segments, which will allow for a solution. In this sense, a probabilistic representation provides the 
necessary understanding to solve the original deterministic problem. 

More precisely, we proceed as follows. In Section [2l we start from the stochastic (i.e., finite- 
population) version of the discr ete-time single-cr ossover model, that is, the Wright-Fisher model with 
single-crossover recombination ( Hein et al.l2005l Chap. 5.4). In the limit of population size tending to 
infinity (without rescaling of parameters or time), a law of large numbers (established here explicitly) 
leads to the corresponding deterministic dynamical system. We recall some general properties of this 
system and discuss the various dependencies (between individuals and between gene segments) that 
have, so far, obstructed an explicit solution. In Section [3] we take the backward point of view and 
consider the ancestry of the genetic material of single individuals. In the limit of infinite population 
size, this ancestry is a random tree for any finite time horizon, that is, segments that have been 
separated once do not come together again in the same individual (with probability one). The law 
for this ancestral tree may be formulated explicitly in terms of a (stochastic) segmentation process, 
which involves conditional independence between segments once they appear. As a consequence, the 
time evolution of the ancestral process may be calculated via a decomposition into subtrees. This 
solution is semi-explicit in the sense that it is a sum of well-defined terms, where summation is over 
certain tree topologies, which must be enumerated in a recursive way. In the same sense, this yields 
a solution of the deterministic forward-in-time model. We will discuss our results in the context of 
related approaches in Section 4, in particular, the ancestral recombination graph (the usual approach 
to recombination in finite populations). 



2 The recombination model forward in time 

2.1 The model 



In this section, we describe the basic setting , the Wright-Fisher model with single-crossover recombina- 
tion, as well as the dynamical system (from von Wangenheim et al. 2010l) that arises as its infinite- 



population limit. A chromosome is described by a linear arrangement of, say, n + 1 sites, namely, the 
elements of the set S := {0, 1, . . . , n}. Sites represent discrete positions on a chromosome that may 
be interpreted as gene or nucleotide positions. Thus, each site i 6 S can be occupied by an allele 
(or letter) x i £ X i , where we restrict ourselves to finite X i . A type x is then defined as a sequence 
x = (xq, x 1 , . . . , x n ) 6 Xq x X 1 x • • • x X n =: X, where X denotes the (finite) type space. Neighbouring 
sites are connected by links, the entities where recombination events may occur. They are collected 
into the set L = {|, |, . . . , 2n ~ 1 } , where link a = 2 'j" 1 denotes the link between sites i and i + l. We 
will only be concerned with single crossovers, i.e., the case where recombination occurs at a single 
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link a £ L and results in a mixed type composed of the sites before a from the first parent, and 
those after a from the second parent. Explicitly, if recombination involves the ordered pair of types 
x = (x , . . . ,x n ) and y = (y , . . . ,y n ), the outcome of recombination at link 2 j" is the recombined 
type (xq, • • ■ ,x it y i+1 , ■ ■ ■ , y n ). The dynamics of a finite population that evolves under sing le-crossover 



recombination can be described by the following version of the Wright-Fisher model (cf. iHein et al 



20051 Chap. 5.4): 

Each link is equipped with a crossover probability g a > (with J^aeL Bot ^ 1)- Each generation 
is of constant size N. In each generation, the current population is replaced by its offspring, where 
each offspring individual chooses its parent (s) independently according to the following scheme (see 
Figure [1]) : 

— With probability g a > 0, a £ L, two parents are chosen uniformly with replacement. They 
recombine at link a, which gives rise to the corresponding recombined offspring with the leading 
segment (the sites 0, . . . , [a]) from the first and the trailing segment (the sites \a], . . . ,n) from 
the second parent, where [a] ([a]) denotes the largest integer below (the smallest above) a; if 
the same parent is chosen twice, it is effectively transmitted unchanged. 

— With probability ^ 1 — J2aeL Qa < 1, & single parent is selected uniformly and with replacement 
from the previous generation. 

We denote the population at time t by 

Z t = (Z t (x)) X £x & E := {v counting measure on X \ \\u\\ = N}, 

where |.| denotes total variation norm and Z t (x) is the number of individuals of type x at time 
t. We will also need the corresponding normalised quantity Zt := Z t /N, which is a probability 
vector (or measure) for the population at time t. In order to formalise the stochastic process, let 
ty j : X — > X j := X jpj-^ii 77 j ( x ) = i x i)ieJ =: x j> be the canonical projection to the sites in J 
(J C S). We specifically need Tv <a := 7T| < a n and 7r >Q := T/r a -| n \- For p £ P{X), the set of 
probability measures on X, we denote by irj.p := p o tyJ 1 (where tyj 1 denotes the preimage of ttj) 
the marginal distribution of p with respect to the sites in J. Furthermore, 

Ra(p) ■■= (tt< q .p) <g> (tt >q .p) (1) 

is the product measure of the two marg inals (before and a f ter a) ; R a is known as the recombination 
operator (or recombinator for short), cf. Baake and Baakel ( 2003 ^. It is clear that an individual that 



recombines at link a £ L in generation t draws its type from R a (Z t _ 1 ), and a non-recombining 
individual draws its type from Z t _^ = R (Z t _ 1 ) , with R := 1 (the reason for this notation will 
become clear later). 

The discrete-time Markov chain {Z t } t ^n on V(X) may therefore be formulated as follows: 
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— Let N a (t), a £ L, denote the random number of individuals generated in generation t via recom- 
bination at link a G L. Analogously, N (t) is the number of individuals that are sampled without 
recombining. Clearly, they follow a multinomial distribution: 

{N^Nx (i),..., JVan-i (t)) ~ M(N, (1 - V p n , pi PaB -i )) , i.i.dforalH. (2) 

2 2 z J 2 2 

— According to the previous step, Zt consists of subpopulations Yg(i), /3 € L U {0}, where Kj(t) 
consists of those individuals that, in generation t, experience recombination at /3 (where ft = 
indicates no recombination). Clearly, 

Yp{t)~M{Np(t),Rf,(Zt-i)), /JeiU{0}. (3) 

— Finally, we obtain Z t via 

Z t = ±(Y (t)+J2Y a (t)). (4) 

Obviously, the resampling-recombination mechanism is independent of the types. So, the Wright- 
Fisher model may, alternatively, be constructed as an independent superposition of the two processes, 
that is, 

(Fl) It is first determined, for each time point and for each individual, which of the sites come from 

which parental individual (resampling/recombination without types). 
(F2) Letters are then attached to the sites at time t = and are then propagated through the model 

to time t according to the relations decided in (Fl). 



2.2 Law of large numbers. 

Let us first consider the Wright-Fisher model in the so-called infinite population limit (IPL), where 
we let N — > oo without rescaling any other parameters. This may be considered as a limit of strong 
recombination, in which the stochastic effects of resampling (also known as genetic drift) are lost. 
This is in contras t to the more frequently used weak recombination limit, which leads to a diffusion 
process, compare lEwensi (|20o4 Chap. 6.6) and Section H below. 

More precisely, we consider the family of processes {Z^ N }teN > N G N (where we temporarily add 
an upper index iV to denote population size) and compare it with the deterministic recombination 
dynamics, where we identify the population at time t 6 No with p t = {pt{x)) x ^x £ "P(X). Here 
Pt(x) denotes the relative frequency of type x G X at time t, and p is the initial population. The 
population is described by the dynamical system 

Pt = Hvt-x), where $(p) := Tl — ^ g a ^P+ ^ Q a Ra{p) , (5) 



which is usually obtained by direct deterministic modelling (jvon Wangenheim et al.ll2010f) . The fol- 
lowing result shows that, indeed, (J3|) describes the infinite population limit of the stochastic process, 
more precisely for the family of processes {Z { t N) } t£No , N G N. We will use $ t+1 = <!> o g>* for the 
composition of the nonlinear mapping 

Proposition 1 (Infinite Population Limit) Let {Zj }teN with N G N be a family of Wright- 
Fisher models with single-crossover recombination (as defined by ([2]) with initial states such that 
lim7v->.oo Zq N ^ = p . Then, for every given t £ No, one has 



where p t = $*(pq) denotes the solution of (O 



lim Z\ — Pt in mean square, (6) 

N—tco 
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The corresponding situation in continuous time and with almost sure conv ergence (see Remark [T]) is 
covered by the general law of large numbers of Ethier and Kurtz (|l986l Theorem 11.2.1), but no 
such general result seems to be available in discrete time. We therefore include a proof. 

Proof (of Prop.Q]) We employ induction over t. By assumption, the claim holds for t = 0. Now assume 
that it holds for t — 1, for some t > 1. We then have 



N^ N \t) N - 



N 



-> g a ,a 6 L, and 



N 



i- ^2q* 



(7) 



by the mean square law of large numbers (cf. iGrimmett 



2001 



Np N \t) — > oo as TV — s> oo with probability one (because > 0) and thus 

^> Ra(pt-i) in mean square, 



Chap. 7.4). Furthermore, for f3 € L, 



N- 



(8) 



since Y^ N \t)/N^>{t)-R p (zY:() 



(N), 



the set of measure where Ng N ' (t) -» oo, and thus altogether in mean square), and Rp(ZYll) 
Rp(pt-i) by the induction hypothesis. Analogously, for 1 — J2aeL > 0, 



> due to the mean-square law of large numbers (except on 



Since, by flU), 



z~r 



>pt_i m mean square. 



(9) 



N 



+ E 



d)-© together tell us that 



which proves the claim. 



Z t >• <P(pt-i) m mean square, 



(10) 



Remark 1 Note that Prop. [T] automatically implies that, for every given t, 



lim max|Z 



(N) 



in mean square, 



which is reminiscent of the continuous-time result ( Ethier and Kurt3 19861 Theorem 11.2.1). Note, 
however, that the latter result is a strong law of large numbers; we have established the mean-square 
version here (which, of course, implies a weak law of large numbers since convergence in mean square 
implies convergence in probability) since the construction of a sequence of processes on the same 
probability space in the discrete-time setting is beyond the scope of this paper. Note also that the 
convergence in ([6]) applies for any fixed t £ No, but need not hold as t — ► oo. Indeed, the asymptotic 
behaviour of the stochastic system is radically different from that of the deterministic one: Due 
to resampling, the Markov chain is absorbing (in fact, it experiences fixation of a single type with 
probability one in the long run). In contrast, the deterministic system never loses any type, and the 
com plete product m easu re with respect to all links in L is obtained as the stationary distribution, 



Geiringer ( 19441) and von Wangenheim et al. (201C). Let us emphasise that it is the short time 



scale, not the long-term behaviour, that we are interested in here; see Section [3] for the discussion of 
the biological context. 
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2.3 Structure of the deterministic solution 



Based upon an initial population p , every individual in the population at time t = 1 is either an 
unaltered copy of an individual from p or it is composed of exactly two recombined segments, hence 
the population p 1 is a mixture of p and the R a (p ), a £ L, in line with (O. For t > 1, the population 
will contain individuals that consist of several segments pieced together from the sequences in the 
initial population due to various recombination events at different times. To describe these, we use 
the composite recombinators R G , GCL, which act on probability vectors as 



Rr 



(11) 



where we set Rs a \ = R a - Here, the product is to be read as composition. It is, indeed, a matrix 
product if the recombinators are written in their matrix representation, which is availa ble in the cas e 
of finite types considered here, provided the problem is embedded into a larger space (|Baake!l200lh . 
This definition is consis tent since all R a are idempotents and commute with each other, compare 
Baake and Baake! (|2003|) . Clearly, R G (p) is the product measure derived from p with respect to all 
links in G. We thus expect the population at any time to be a convex combination of the R G (p Q ) 
with GCL. This means 



Pt=$ (Po) 



GCL 



(12) 



with 



M. 



von Wangenheim et al 



t) ^ for all GCL, and £ 



GCL a G 



; (t) = 1. It has been proved by 



(2010) ;hat the solution indeed has this form, but plausibility arguments 



go back to iGeiringerl (|l944l) . The difficulty consists in determining the coefficient functions a G (t). 
Let us introduce the following abbreviations, 

G <a :={/3€G\f3<a}, G >a := {/3 e G \ (3 > a} , 
G <a := {/3€G|^c}, G >a := {fi € G | > a} . 

Let us recall the recursion for the coefficient functions from von Wangenheim et al. ( 201ol) : 

Theorem 1 For all G C L and t € No, the coefficient functions a G (t) evolve according to 

oc(*+l) = (l-$Ze«)a G (i)+^ec«( J2 OG <e ,Uff(*))( J2 a KUG >a (t)), (13) 



aEL 



a£G 



HCLi 



KCL< 



with initial condition fflg(0) = 8, 



G,0- 



A verbal description of this iteration can already be found in IGeiringerl (jl944l) . It will become clear 
later that we may interpret a G (t) as the proportion of the population whose types have been pieced 
together by recombination at exactly the links of G. 

Due to its nonlinearity, the recursion does not allow for an immediate solution (at least from 
four sites onwards). The nonlinearity comes from the dependence of links: Due to the single-crossover 
assumption, a crossover event forbids any other recombination events in the same time step. In sharp 
contrast, and quite surprisingly, the ana logous (deterministic) s i ngle-crossov er model in continuous 
time has a very simple explicit solution (jBaake and Baakdl2003l : lBaakdl2005T) . The main reason for 
this is the fact that simultaneous crossover events are automatically excluded in continuous time. 
This implies an effective independence of links, which turns the dynamics corresponding to Theorem[T] 
into a linear one. For a detailed investigation of the differe nces between single-crossover dynamics in 
continuous and in discret e time, the reader is referred to von Wangenheim et al. ( 2010l) . 

The conventional way (|Bennettlll954HDawsonl[2000[l2002j) to overcome the obstacles of nonlinear- 
ity in recombination models lies in finding an appropriate transformation of the dynamics to a solv- 
able diagonalised system, but this usually involves a new set of coefficients that must be constructed 
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in a r ecursive manner. We have performed this for the single-crossover model (jvon Wangenheim et al 



2010), but the solution still requires recursions and does not lead to closed-form expressions for the 



a G (t). In contrast, we will pursue the stochastic perspective here and look at recombination backward 
in time, which will lead us to the coefficient functions in semi-explicit form. 



3 Ancestral recombination process 

3.1 The ancestral process 

In the ancestral recombination process, we follow the ancestry of the genetic material of a selected 
individual from a population that evolved according to the Wright-Fisher model with single-crossover 
recombination of Section 12.11 To this end, we start with an individual in the present population at 
time t and let time run backwards, as illustrated in Figure [2] for two individuals from the realisation 
of the Wright-Fisher model in Figure [1] Let us first describe the resulting partitioning of sites into 
parents, keeping in mind that this happens independently of the types, in analogy with step (Fl) in 
the forward model. 

We denote by r the time backward from the present at time t, i.e., backward time r corresponds 
to forward time t — r. (Note that, altogether, we use the symbol t both for the variable of time 
and for the fixed number of generations for which the (forward-time) dynamics is considered. In the 
latter sense, t stands for 'today'.) We capture the partitioning by a process {XV}tgn on n(S), the 
set of partitions of S. Here, the parts of E T correspond to the parents at (backward) time r of our 
individual at (forward) time i; sites in the same part correspond to sites that go back to the same 
parent. In view of the forward Wright-Fisher model, it is clear that {X , t } tS n is declared as follows. 

Start with Eq = {S}. Assume now that for some r > 0, E T = a := {o~ 1 , . . . where o-j = 

{o~ji, . . . , o~j n . } and we imply pji < (Tj 2 < ■ ■ ■ < &j n > 1 ^ j ^ k. Referring back to the Wright-Fisher 
model, Et+i is obtained in two steps: 

(S) Splitting: Every part aj of E T , 1 $J j ^ k, independently of the others, either remains unchanged 
(probability 1 - J2a n < a <a ]rl . 8a), or, for every a jl < a < a jn ., it may split into {cr^,. . 
and {°"jp Q -| i • • • t a jnj} (probability g a ). Note that two or more a's can lead to the same split if o~j 
is not contiguous, where 'contiguous' means an uninterrupted run of sites. The resulting refined 
partition is denoted by E' T . This step corresponds to the splitting of the ancestral material into 
smaller segments due to recombination, where we do not yet decide which segment ends up in 
which parent. 

(C) Coalescence: Each part of E' T now chooses one out of N parents, uniformly and with replacement. 
Parts that end up in the same parent are united; otherwise, nothing happens. The resulting 
partition is S T +l- Figure [2] illustrates this: If all parts are assigned to different parents, then no 
coalescence takes place, that is, E T +% = E' T (as in Figure [21 left). If two or more parts go back 
to the same parent, we have a coalescence event, see Figure [2] (right). 

A closely related process describing the an cestry of single i ndivid uals in continuous time and on a 



continuous chromosome was investigated by IWiuf and Heinl (1997), but in the weak-recombination 



limit, and with a different purpose; we will come back to this in Section [3] 

Our aim is now to determine the law for the ancestry and the type of a random individual at 
time t without constructing a realisation of the forward Wright-Fisher model first. Such an individual, 
together with its ancestry, may be constructed in a three-step procedure, see Figure [3] 

(Al) Run {XV} T gN until r = t. Et tells us how the ancestral material of our individual is partitioned 
into parents at forward time (in Figure [3] this is the top of the tree). 
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Fig. 2 Ancestries for two individuals from the Wright-Fisher population of Figure [T] We trace back the ancestry of 
the segments present at t = 2; the thin black lines indicate nonancestral material whose history is not relevant. The 
left graph refers to the second individual from Figure^ here the two segments go back to two different ancestors. 
The right graph corresponds to the third individual from Figure^ Two of its three segments have the same parent 
at t = due to a coalescence event. In the infinite population limit, such a situation does not occur; rather, all 
possible ancestries are binary trees. 



(A2) Assign a different colour to each part of St- The colours are for illustration; each colour cor- 
responds to one individual from the initial population (at t = 0), chosen uniformly without 
replacement. Equivalently, one may sample a parent from the initial population with replacement 
for every part of S' t _ x . (The latter is more convenient and will be favoured in what follows.) In 
any case, every site receives a colour, which is propagated downwards. This results in the present 
individual pieced together from segments of different colours that correspond to different parental 
individuals. 

(A3) Assign a letter to every site at t = (i.e., r = t). By (A2), this entails that the type for part a-j 
of S't-i is drawn from Tr aj .Zq, independently for every element of the partition. Like the colours, 
the letters are attached to the sites once and for all, and thus propagated downwards, i.e., down 
to X . 

As a consequence of (A2) and (A3), conditional on E' t _ 1 = a = {a\, . . . , crj,}, the type distribution 
at present (that is, at forward time t) is :(7r CTl .Zq) ® • ■ • ® {Tr ak -Zq):, where : . . . : means that the factors 
are ordered as in X. Denoting by St the type at forward time t, we thus have 

ns't-i ={a 1 ,...,a k },S t =x)= V{Ei-i =Wi,...,cr k }); (tt CTi Jo) ® ■ ■ ■ ® (7T CTfc Jo) : (*) . (14) 

Eq. (|14[) gives the marginal distribution (of partition and type) for every single individual in a 
sample, or in the entire population. Due to coalescence events, however, the individuals in a finite 
population are correlated, and the joint distribution is a difficult matter. (This is investigated within 
the framework of th e ancestral recom bination g raph, which tra ces back the genealogy of a sample of 
individuals; compare Wakelev 2008 . Chap. 7.2, Durrett 20081 Chap. 3.4, and SectionH]). 



Our goal here is a somewhat simpler one, namely, the distribution of types and ancestries in 
the N — > oo limit (under strong recombination). In this limit, the partitioning process simplifies 
substantially due to the following result. 

Lemma 1 Let fit be the event that X T = S T _i for 1 ^ r ^ t; that is, no coalescence occurs until 
(backward) timet, or, equivalently, {r r }o^ T ^j is a process of progressive refinements of ordered partitions. 
For every fixed finite t, one has P(J2 t ) > 1 - n(n + l)t/2N + 0(l/N 2 ). 

Proof For every r, in step (S), E' T is obtained from XV as a refinement. It is thus clear that {X T }o^ T ^ t 
is a process of progressive refinements (and hence of ordered partitions) if and only if X T = E' T _i 
for 1 ^ r ^ t. If £ T —i has k parts, then the probability that each is assigned to a different parent in 
the coalescence step leading to X r is 

*»' -M)-M)--('-V)- 
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Fig. 3 Construction of a random individual at time t, together with its ancestry. Top: partitioning of sites 
(backward; step (Al)); middle: assignment of colours and letters at the top; bottom: propagation of colours and 
letters downwards. The middle and bottom panels together correspond to the simultaneous performance of steps 
(A2) and (A3). In this example, there are no coalescence events, so all partitions arc ordered. As a consequence, this 
realisation of the partitioning process {ZV}o^ T ^£ is a tree and, at the same time, a realisation of the segmentation 
process {-FV}o^T^t of Section I3.2I 
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Obviously, 

q k > q n+1 = 1 - !&±1> + 0(1/N 2 ) (16) 
because fe < |5| = n + 1. As a consequence, 

for every fixed finite t. □ 

Note that Lemma [T] implies that, for any finite t, coalescence events are absent in the N — > oo 
limit and the ancestry is a tree — in line with intuition, and as in Figure [2] (left), and in Figure [3j 
Note also that Lemma [1] holds for any finite t, but not for t —¥ oo, in the same spirit as the law of 
large numbers in Prop. [T] 



3.2 Segments and the segmentation process 

Since, as we have just seen, we only have to deal with ordered partitions (with probability one for 
any finite t as N — > oo), we can introduce a simplifying notation for the partitions that is based on 
links rather than on sites. This is because or dered partitions are in one-to-one correspondence with 
the subsets of L as follows. As in lBaakd (|2005l) . let G = {a ly . . . , Q| G |} C L, with a± < a 2 < • ■ • < a| G , 
an ordering which we will assume implicitly from now on. Let then 5(0) := {S} and, for G ^ 0, let 
5(G) := {ffi,<72j ■-•i cr |G|+i} denote the ordered partition of S with parts 

ai := {0,..., LaiJ}, cr 2 := {[ail,..., [a 2 \ },..., <T| G |+l := { l~ a |G|l « • • • « n }- ( 17 ) 

In particular, S(L) = {{0}, . . . , {n}|. It is clear that S(H) is a refinement of 5(G) if and only if 
G C H. It is also obvious that 5 defines a bijection; its inverse, tp := S , associates with every 
ordered partition of S the corresponding subset of L, so that ip(S(G)) = G for all GCI. 

We now define the associated ordered partitions C G of L \ G. Let £ = {L} and, for G ^ 0, set 
(cf. Figure H]): 



£ G := |{a £ L : i s$ a < ai}, {a 6 L : ai < a < a 2 }, . . . , {a £ L : a| G | < a ^ ^T^"}} 
£ G := C G \ {0} . 



(18) 



That is, Lq is the ordered partition of L \ G that holds the segments (in the sense of contiguous sets 
of links) that arise when recombination has occurred at all links in G; in particular, Cl = {}. 

Let us now consider the following process of progressive segmentation (which will turn out to 
coincide with the N — >• oo limit of the partitioning process for any finite time) . 

Definition 1 (Segmentation process) The segmentation process is the discrete-time Markov chain 
{F T } T gN , where F T takes values in the power set of L according to the following rules. Start with 
Fq = and recall that C = {L}. If F T = G, choose either none or one link in every segment, 
according to the following rule. From segment I of Lq , independently of all other segments, either 
no link is chosen (probability 1 — J^ Qg j Q a ), or a single link is chosen, namely link a € I with 
probability g a . Then F T+ ± is the union of G with the set of all newly chosen links. That is, 

F T+1 = F T UA, where A = ( (J Aj) (19) 



and 



( 0, with probability 1 - J2 a ei Sa , ^q) 
{a}, with probability g a , for all a £ I . 
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{ 2 } { 2 ' 2 } 



t 



1 3 5 7 9 

2 2 2 2 2 



Fig. 4 The segments induced by G = {§,§} in the case L = {§, •••,§} (i.e., C G = {{§},{§, C G 
{!>!}})• -Cg corresponds to the ordered partion 5(G) = {{0, 1}, {2, 3, 4}, {5}} of S = {0,1,2,3,4,5}. 



Clearly, picking a link corresponds to recombination, and F T is the set of links that have been cut 
until time r. Note that, as in the Wright-Fisher model with recombination (and its deterministic 
limit), the links are not, in general, independent: At most one link in a given segment may be cut in 
one time step; cutting of one link prevents cutting of any other link in the same generation. However, 
the backward point of view adopted here reveals (conditional) independence of the individual segments 
once they arise. Put differently, links are independent as soon as they are on different segments. This 
is analogous to the conditional independence of offspring inviduals in branching processes and will 
turn out as the golden key to the solution. 

The connection of the segmentation process with the partitioning process can now be clarified 
(we use upper indices once more to denote the dependence on population size): 

Proposition 2 Let t ^ be arbitrary but fixed. The law of {r| W '}o^ T ^t then agrees with that of 
{■FV}o<T<t U P to C>(l/N), provided = cr} is put in bijective correspondence with {F T = tp(cr)} 

for all ordered partitions a and ^ r $J t. For individual time points t ^ t, this implies specifically that 

- cr) - \^ Ft = + 0{l/N), if a is an ordered partition, 



k O(l/JV), otherwise 
for every a G Fl(S), with rp = <S _1 as defined after (|17|). 

Proof ft is clear that, under the above identification, the initial conditions (£q N ^ = {S} and Fo = 
ip({S}) = 0) agree. It is also clear that, if E { T N) = S{G) for some G C L, then {S {N) )' T follows the 
same law as F T +\ given F T = G (by step (S) and Def. [T|). 

Consider now {^r^^jo^T^t conditional on fit- By the above observation together with Lemma[Tl 
the law of the conditional process may be understood as follows. Run {-FV}o<T^tj but in every step 
r, kill the process with probability 1 — (from ([15])) if \F T \ = k. The law of the surviving process 
then is the law of {IZf | fit}o^r^t- Since the killing probability up to time t is 1 — P(J?t) = 0(1/N) 
(see Lemma[TJ), the law of {£■!■ | Ot}o^r^t agrees with that of {F T }o^T^t up to 0(\/N). Finally, 
using 1 — V(Ot) = 0(l/N) once more yields the claim. 

□ 

Let us remark that (|2ip (evaluated at time r — 1) also entails that 

, ( jv)y v _ s _ I V(F T = tp{o)) + 0(1/N), if cr is an ordered partition, 
^ 0(1/N), otherwise 
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since {^ N ^)' T -i is obtained from according to the same law as F T from F T _i, provided E^-i 

is an ordered partition. Let us also remark that, for finite N, {£r I ^t}o^r^t only agrees with 
{F r }o^ T <^t up to a probability of 0(1/N). This is due to a bias, in a finite population, towards 
partitions with fewer parts, since these bear less risk of coalescence. 

Before we proceed, let us note another elementary, but crucial property of the segmentation pro- 
cess. Consider the segmentation process {F^ }t£N on a contiguous subset Z of L. Here {Ff- }tgN 
is defined in the same way as {F T } T ^ but with L replaced by Z, and based on the recombination 
probabilities g a , a £ L, alone. Here, the upper index now indicates dependence on the (sub-) set of 
links, which we may omit if Z = L; that is, F^ = F T . Likewise, we will denote by Cq , GCi, the 
partition of Z \ G defined in analogy with (|18[) . with L replaced by Z. We then have the following 
important fact. 

Proposition 3 (marginalisation property) Let L be a contiguous subset ofL. The process {F^ }tgn 
then is the margi 
t G No, we have 



then is the marginal version of {-Ft }reN with respect to the links in L, that is, for all G C L and all 



P(F T L) =G)= F i F r L) = GUH). 

HCL\L 

Proof We will prove the claim by showing that, for every GCL and H C L\L, the set of links 
picked from the segments of if F^ = G follows the same law as nZ, where A^ is the set of 
links picked from the segments of Cq^ h if F^ = GUH (according to (|19[) and (|20[) . and likewise for 
Z). Due to the independence of the segments (within and within Cq^tj), we may consider these 
segments separately. Segments that are contained in both and Cq^ h contribute identically to 
and A^ nZ by construction. Segments I G witn ^t = 0do not contribute to A^ nZ 

and are independent of those in Cq and thus of A^ . We are left to consider segments Z G Cq 
with I CI e for some I. But here the probability to pick any a £ I (for A^ ) is the same as 

picking this a from I for A^ n Z, namely, g a , which completes the proof. □ 

We are now ready to state the main result of this section. 

Theorem 2 (type distribution via ancestral process) Consider a sequence of Wright-Fisher models 
with single-crossover recombination and increasing population size N . Let the initial states be such that 
limjv-i-oo = Po- The type distribution of any given individual for any finite t G No converges to 

^ GCL P(Fj = G)Rq(p q ) as N — > oo. For the composition of the population, we have 

lim Z\ N) = V P(F t = G)R G {p ) in mean square. (23) 

N—>oo 

GCL 

Clearly, (|23[) is again a law of large numbers, analogous to the infinite-population limit of Prop. [TJ 
but this time expressed in terms of the backward process; the connection will be exploited in the next 
section. As in Remark [TJ the result again carries over to finite time intervals. 

Proof (of Theorem^) We prove the theorem by considering the joint distribution of partitions and 
types, E't-i an d Sfi as i n (|14[) . We will omit the dependence on N for ease of notation. For a given 
single individual, we first rewrite the joint probabilities as 

P(4_i =a,S t = x) = P(4-l = a) P(St = x \ Z't-i = a). (24) 



As to the first term on the right-hand side, recall that, by (|22[) , the only partitions that survive in 
the limit are <S(G), GCL, for which P(^_i = S(G)) = P(F t = G) + 0(1/N). As to the second term, 
(fTl)) tells us that the type distribution corresponding to 5(G) is 

(n^.Zo) <8> • • • <8> (-n-cr lGl+1 .Z ) = R G (Z ) 
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with o\, . . . , ffici+i °f (|17p . But Rq(Zq) converges to Rg(po) by assumption. By Lemma[TJ in the 
limit N — y oo, (|24p thus becomes 

hm nzU=^ t = X ) = { nFt = GKRG{P ° mX) (25) 
N^oo otherwise, 

from which the type distribution for single individuals follows by marginalisation over S^_ 1 . 

As to the population, let individuals be numbered 1,2,..., N, and let E t _ li and St,i be the 
partition at (backward) time t— 1 (after splitting), and the type at (forward) time t, respectively, of 
individual i, 1 ^ i ^ N (so that the above E't-i an d St may be identified with ZJ't-i l an d — t,i)- 
The E t _ 1 j are identically distributed (across i), but not independent (they are correlated due to 
common ancestry) ; the same holds for the E t j . Clearly, 



1 N 



; = 1 



where 1{. . .} denotes the indicator function for the event in question. We will show that, for all x g X 
and a E II(S), 



N 

N^oo V N 



lm) fi£l{ £ t-l,i = ff . s *,i = af }- p (- E t-M= ff . fi 'M=a'))=0 (26) 



<=i 



in mean square. Eq. (|23[) then follows via summation over all a £ n(S), together with the result for 
single indviduals, which tells us that V({S t ,i = x}) jV ^°°> P(F t = G)(R G {p )){x). 

To establish (|26[) . it is sufficient to show that the covariance of l{i?t— 1 1 = cr,St,i = x} and 
l{E' t _ 1 2 = a,3tp, = x} is 0(1/N); due to exchangeability, this then carries over to arbitrary pairs 
of individuals. Let Qt be the event that no coalescence happens between ancestors of individual 1 
and those of individual 2 until time t (while coalescences between ancestors of the same individual 
are allowed). In analogy with (|16[l . the probability of no such coalescence in a single time step is 
bounded from below by 

,-:=(l-^)" +1 = l~^ + WiV 2 ) (27) 
since each individual has at most \S\ = n + 1 ancestors. Thus 

p(rtt) > t = 1 - (n + ^ )H + 0(1/N 2 ) 

for every finite t as N — > 00. We now consider {i7 Tl ,X' T2 | J2t-i}o<T<t-i- Arguing in a similar way 
as in the proof of Prop. [21 we find that the law of the conditional joint process agrees with that of 
two independent copies of {£' T }o^T^t-i as long as there is no common ancestry between parts of S' T 1 
and those of S' t 2 - This is the case with probability P(J?t-i), which deviates from 1 by 0(1/N), so 
that 

P(Zl-i, 1 =<T,£'t-i,2=<T\ O t -i) = (P(X^-i,i = °)f + 0{l/N). (28) 
Next, on Qt—i, the parts of E' t _ l l and of £[-1,2 pick their types independently, so that 

P(S t) i =x,S t ,2 =x\ Z't- 1 ,i=a,Z't- 1< 2 = a,n t -i) = (P(S t ,i = x \ S't-t, x =a)f. (29) 
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Taking together 1 - P(f2 t -i) = 0(1/N), (gHJ and ([25]). we get 

PC^t-i.i = C)-£t-l,2 = f, = x, St,2 = x) 

= ns't-i.i = v, Z't-i,2 = a, St,l = a, S tj2 = x I «(_!) + <D(l/N) 
= P(£t,i = £C, S t>2 = a; j Xt-i,i = o, E t -i t 2 = cr, fh-x) 

x F(E' t _ 1A = a, E' t _ h2 = a \ Sl t -i) + 0(1/N) 
= (P(S t ,i =x,4-i,i =^)) 2 + 0(W), 

so that indeed 

Cov(l{^_i, 1 = a, ~ t ,i = x}, 1{4-1,2 = o-, S t , a = x}) = 0(1/N), 
which establishes (|26p and proves the claim. □ 



3.3 Connection with the deterministic dynamical system 
A main result now is 

Theorem 3 For all G C L and all r ^ 0, we have 

P(F T = G) =o G (r). 

We give two proofs that result in different and mutually complementary insight. The first uses a 
general argument, the second a concrete calculation. 

Proof (First proof of Theorem^) Compare the two laws of large numbers, Prop.[T]and Theorem[2j The 
claim is obvious via comparison of coefficie nts. The latter is justified by t he following observation (cf. 



the argument in the proof of Theorem 3 in von Wangenheim et al. 201dh . For generic pp and generic 



Xi, the vectors Rg{po) with CCi are the extremal vectors of the closed simplex conv{Rx{po) I K C 
L}, where conv denotes the convex hull. They are the vectors that (generically) cannot be expressed 
as non-trivial convex combination within the simplex, and hence the vertices of the simplex (in cases 
with degeneracies, one reduces the simplex in the obvious way). □ 

Proof (Second proof of Theorem [3p We simply show that og(t) and P(F r = G) follow the same 
recursions (with the same initial values). To this end, recall that we have implied 

F(F T = G) = P(F t = G\F q = 0). 

We then decompose the r+1 time steps into the initial step, followed by an interval of r steps (in 
the spirit of the Kolmogorov backward equation) to obtain 

P(F T ( +\ = G)= P{F^\ = G I F (i) = 0) = F ( F i (L) = H | F (i) = 0)P(F^ L) = G | F (L) = H) 

HCG 

= F(F<t L) = | F (L) = 0)P(F^ L) = G | F (L) = 0) 

+ Y P(F 1 (i) = {a} I F (i) = 0)P(F^ L) = G | F (L) = {a}) 

= (1 - Y ^) p ( F r L) = G) + Y, Q a n4 L<a) = G <a )F(F$ L >^ = G>a) . 

(30) 

In the last step, we have used P(F^ L) = G|F (L) = {a}) = F(F^ L<a) = G <a )F(F^ L>a) = G >a ), 
which is due to the conditional independence of segments according to Def. Q] Knowing Prop. [3] (for 
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L' = L< Q and hence L\L' = L> a ), the recursion (|30p is identical to the one in (|13p; together with 
the identity of the initial conditions, 

a W(O)=P(F o (i) =G) = 5 G , , 
this proves the claim. □ 

It is important to note that the second proof does not rely on Theorem [2] Theorem [3] could thus 
be used to establish Theorem [2] by simply invoking Prop. [1] and the deterministic time evolution 
(|12[) . However, the independent proof of Theorem 2 bears the great advantage that it only requires 
the stochastic arguments derived in the current paper, thus making the argument self-contained and 
independent of the knowledge of the deterministic dynamics developed in previous work via quite a 
different toolbox. 



3.4 Towards an explicit solution - preparation and example 



Before we proceed, let us consider an important aspect of the segmentation process, namely, the 
probability that nothing happens in one time step given the current state is G. For any contiguous 
L C L, let 



A 



(L) 



(F%\=G\F™=G) = J] t 1 - Z><0' forGCL. 



(31) 



iec 



G 



a£l 



As before, we will omit the upper index in the case of L = L, that is, X^ = Xq. Since for every 
I 6 one has C@ = {I}, and thus A^f ' = 1 — J2 a ei we can rewri t e (|3 ip as 



A 



n 



A 



(/) 



(32) 



The coeffici ents A^j have a lready been identified b y Bennett (1954), Lvubich ( 19921) . and Dawson 
(2000, 2002 ). as well as by von Wangenheim et al. ( 201ol) as the generalised eigenvalues of the lin- 
earised deterministic dynamics. 

Our aim now is to find a closed-form expression for V(F T = G) for all r. Clearly, P(F T = G) is the 
sum over the probabilities for all paths that lead to the state F T = G. Each path of the segmentation 
process may be represented by a tree, which we will call ancestral recombination tree or ART for short; 
we thus have to sum over the corresponding trees. Considering the trees carefully will be the key to 
the solution. Let us illustrate this by means of an example. 

Example 1 For four sites S = {0,1,2,3} with the corresponding links L = we consider 

P(F T = {i, §}), as illustrated in Figure^ That is, we are concerned with all paths of the segmentation 
process that lead to F T = {i, |}. The left tree captures the path where link ^ is the first to be cut, 
the second that link | is cut first. The A^, are the probabilities that nothing happens to any of 

| is the first event, the additional factor 



the current segments for j time steps. In the case where 



({§}) 



(1 



is required to guarantee that at the time of the second segmentation event (at link 



A 

i), the segment that belongs to link | remains unchanged (the corresponding term in the other case 
is X@ ' = 1). Finally, summing over all possible time combinations, one obtains 



r-2-fe 



k—i 



k=0 



i=0 

t-2 -r-2-fc 



(33) 



k—i 



fc=0 
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where the first double sum belongs to the left, the second to the right tree of Figure Unsur- 
prisingly, the same result is obtained by explicitly solving (|13[) (by the method established by 



von Wangenheim et al.ll2010h . which demonstrates once more that P(F T 
in line with Theorem [3] 



{*,§}) 



O/i 3|(t), 

V 2 ' 2 J 



— 1- 



,1 3 i 



— 1 2 3 



1 2 3 




2 3 



2 3 




Fig. 5 The two possible paths of the segmentation process of Example [T] that lead to F T = {i, ^}. The left panel 
refers to the first double sum of H33II . the right one to the second. 



P(F T = G) may be understood as a sum over both tree topologies and branch lengths, i.e. we 
are concerned with all possible ultrametric binary trees that can be produced by the segmentation 
process. (The trees will be explained in more detail later. For the moment, recall that, in a binary tree, 
each internal node has at most two offspring nodes. An ultrametric tree is a tree whose branches 
are assigned len gths such that all leave s have the same distance from the root. For a review of 
metric trees, see ISemple and Steell I2003L Chap. 7). In our case, the branch length corresponds to 
the number of time steps between consecutive nodes, and each internal node with two offspring 
nodes corresponds to a recombination event. We will now show that (and how) it is sufficient to deal 
with the corresponding tree topologies instead, which are obtained by contracting consecutive edges 
connected by a node with a single offspring into a single edge and removing the branch lengths. The 
result of this (many-to-one) operation is the topology of a full binary tree, that is, every internal node 
has exactly two offspring nodes (which may be internal nodes or leaves). The probability for each 
topology then is the sum of all probabilities of all the underlying original (ultrametric) trees, that 
is, the probability for all possible combinations of branch lengths (cf. the double sums in (|33p .) It 
will turn out that these sums may be evaluated explicitly for each topology, which is the reason that 
this approach is useful. For Example [TJ this will (after a simple but lengthy calculation) result in 

P(F T = {|, §}) = P(tree 1) + P(tree 2) 

= ((%f > - AS) \ i .3 _ (A 1 " AS) ) (34) 
+ ((A/i S} - \ T ) X gj _ A - (AS - ^)x^r) , 

V l 2 ' 2 ' { } 2 ' 

where tree 1 refers to the left and tree 2 to the right panel in Figure [5j 



3.5 Ancestral tree topologies 



Our aim now is to assign probabilities to each of the possible topologies that have the elements of 
a given set G as their internal nodes. Once the probabilities are known, P(F r = G) is obtained by 
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summing over all compatible topologies. Let us begin with a suitable definition for our tree topologies 
(see Figure [6] for an illustration). 

Definition 2 For / C C t, a tree topology is denned as T := (G,m), where G signifies the set of 
internal nodes, 7 G G designates the initial branching point of the tree, and, in addition, r is the root. 
The function m is given by 

m : G — ► G U {r} 
a \-¥ m(a) , 

and m(a) denotes the (unique) ancestor of the internal node a £ G. m(a) is an internal node except 
for a = 7, where 711(7) = r. We will assume throughout that m is tree-consistent, that is, the resulting 
structure is a full binary tree topology. For G = 0, the only tree topology is the empty tree (with 
no internal nodes). 

Thus, T has the internal nodes a G G and the set of internal edges 



{(m(a),a) \ a 6 G, m(a) ^ r}, 



as well the external edge (r, 7). Note that we have not included the (external) leaves (and corre- 
sponding external edges) in our definition since they will never be required explicitly. Note also that, 
in the context of phy l ogeny , the (canonical) root of the tree is what we call initial branching point, 



cf. 



Senrple and Steel (2003). For an excellent account of terminology and properties of trees, see 



Gross and Yellenl (jl999f chap. 3). 

We will use the standard partial order on T, namely, a =4 f3 means that a is on the path from 7 
to /3, i.e. a = m l (/3) for some i G {0, . . . , \G\}. Obviously, 7 is the minimal element of G with respect 
to =<!. Furthermore, a -< f3 means that a =<! /3 with a ^ (3. 

Note that the topology T as such does not depend on L except via the requirement G C L; it may 
likewise represent a realisation of a process {F^ , }o^ s ^r , restricted to a (contiguous) subset L of L, 
provided G C L. If we also specify the set of links, say L, then each edge of a given tree topology 
can be associated with a particular segment. Namely, for T = (G,m) and a G G, we associate with 
the edge (m(a),a) the segment 

li L) {T) := K e C ( ^ a} s.t. aeK. 

r(£) 



In words, 7„ ' (T) is the segment that will receive its next cut at link a (given the topology T). In 
particular, lfy L \T) = L (independently of T). An example is given in Figure(6] From now on we will 

suppress the dependence on T and L throughout and write I a instead of Ia(T). Next, we define 
subtrees. 



Definition 3 (Subtrees and subtree decomposition) Consider T = (G, m) with 0/GCt. Then, 
for any 7 6 H C G and a G G, a subtree of T is defined via T a {H) = (G a {H), rn\ Ga ( H \), where 

G a {H) := {13 G G\a 4 j3 and h -fr f3 Vft G H with a -< h} , 

and m| G is the restriction of m to G a (H). Specifically, we set m\ G ^)( a ) =: r a for the initial 
branching point of the respective subtree (so that r = r 7 for consistency). The collection {Tp(H)}p e jj 
describes a decomposition of T into subtrees, where To(H) has initial branching point fi and internal 
nodes Gp{H). 
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Fig. 6 Example of a tree topology T for G = {a lt . . . ,a 5 }, with root r and initial branching point 7 = a 2 . 
Each (internal and root) edge is identified with a certain segment. Here, we have I a = L< 7 , Ia 2 = Ij = L, 
Ia 3 = {7 + 1> ■ • • > a 3, ■ ■ ■ , a 4 ~ !}> 1 ol A = L >1 and I as = L >aii . 



Intuitively, the decomposition is obtained by 'cutting the tree below each element of W . The 
tree then disintegrates into the subtrees {Tp(H)}p£H , and each element of H appears as the initial 
branching point of one of the subtrees; Figure [7] provides an example. The T a (H), a € G\H, are, in 
turn, subtrees of these subtrees; they will also be required in what follows. 

Obviously, G a (H) depends on the topology T (via the partial order), but again we omit this 
for economy of notation. Note that the subtrees inherit the segments from the original tree, i.e., 
Ip(T a (H)) = Ip{T) = I p. Let us mention that similar subtree decompositions appear in the context 
of molecular phylogeny, for example, Tuffley's poset (jGill et al.l l2008). 




Fig. 7 Decomposition of a tree topology into three subtrees via H = {7, a 4 , « 5 }.The subtrees are labelled with their 
initial branching points, so the decomposition consists of Tj(H), T ai (H), and T a& (H), with node sets G-y(H) = 
{0^7}, G ai {H) = {a 3 ,a 4 }, and G as {H) = {a 5 }, respectively. 



3.6 ART probabilities and explicit solution of the segmentation process 

Let us now assign probabilities to tree topologies. To this end, consider the augmented segmentation 
process {F r } r6 N with values in the set of all possible tree topologies T = (G,m) (rather than the 
sets G alone); F T = (G,m) means that F T = G, and the segmentation events have occured according 
to the partial order implied by m (as in (|34p of Example [1]). We will abbreviate V(F r = T) as P T (T). 
Let us now state the central result for these tree probabilities. 

Theorem 4 (ART probabilities) Under the segmentation process, the probability for the tree topology 
T = (G,m) at time r is given by P T (T) = (A^) r for G = 0, and, for ^ G C L and initial branching 
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point 7 G G, by 
Here, for H C G, 
with 



jEHCG 

f(T,H):=l[g(T a (H)) (36) 
ff(T a (ff)) := ga /or a £ G, H <Z G . (37) 



Remark 2 A few remarks are in order: 

1. Note that the dependence on r is solely due to the term in square brackets in (|35[) . whereas / is 
independent of time. 

2. The same term implies that Po((G,m)) = for all G ^ and all tree-consistent mappings m. 

3. The g(T a (H)) a re well-defined and strictly po sitive since G a (H) ^ implies that Xc a (H) > 



(cf. Lemma 5 in von Wangenheim et al.l 2010 ) 



4. Eq. (|35|) implies a sum over all possible subtree decompositions of T; for every given decomposi- 
tion, the subtree containing 7 plays a special role. 

5. We have implied here that the underlying set of links is L, i.e., F T (T) = P(F^ L ^ = T). However, 
due to Prop. [3] the result carries over if L is replaced by any contiguous L C L that contains G. 

Before we embark on the proof, let us briefly comment on the general strategy. Like every Markov 
chain, the segmentation process can be viewed in forward or in backward direction (with respect to 
the time increment on the r time scale): If the increment is at the end of the time interval, then the 
corresponding ultrametric tree grows at its top (i.e. the external branches are extended or split up); 
otherwise it grows at its base (i.e. the root branch is extended or the two corresponding subtrees 
coalesce). Where the original formulation (Defmition[TJ) is in the bottom-up direction, the advantage 
of the top-down approach is that one only has to deal with two objects in every step, namely the left 
and the right subtrees that emerge via the first segmentation event on the r timescale (and that are 
joined when looking back), instead of a possibly large number of smaller segments at the top. This 
point of view has already been used in the proof of Theorem [3] and will again serve in the following 
proof. 

Proof ( of Theorem® ) We will prove the claim via induction in the top-down direction by progressively 
merging pairs of subtrees. To do so, we first need some properties related to the corresponding tree 
decomposition, see Figure [8] Consider a tree topology T = (G, m), with ^ G C L. The initial 
branching point of T is 7 G G as before. If 7 has two (internal) offspring nodes, these are denoted by 
7 G G< 7 C L< 7 and 7 G G> 7 C L> 7 . We then define the left subtree T of T as T = (G< 7 , m\ G< ) 
and analogously the right subtree T as T = (G> 7 ,m| G> ^), both obviously with fewer nodes than 

T. For convenience, we will denote L< 7 (L> 7 ) by L (L ) and the respective nodes by G = G< 7 
(G = G> 7 ). If 7 has no or only one offspring node, we take the empty tree (where nothing happens) 
as left and/or right subtree. In any case, the offspring nodes of 7 are specified through the preimage 
of 7 under the function m, i.e. m _1 (7) G {{0}, {7 },{7 },{7 ,7 }}• T is then obtained by joining 
these subtrees together at 7. In terms of the segmentation process, this corresponds to the very first 
cut (of L, at link 7). Since this may happen at any time j G {1, . . . , r}, (i.e. the root branch lasts for 
i = j - 1 6 {0, ... ,t — 1} times while T and T apply for the remaining r — 1 — i time steps), it is 
clear that 

r-l 

P -( T ) = Y, ^^7 P r-l- 4 (r> r -l- l (r"). (38) 

Before we can evaluate (|38p . we need three preparatory results. 
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Fig. 8 Joining together the left and the right subtrees (T' and T") at the initial branching point 7. Three different 
cases arise, depending on whether one or both of the subtrees are empty trees. 



(A) Product rule for the X 's 

For all L = L U l" U {7} and all G C L , g" C l" , we have 



G'UG"U{7} ' 



which follows immediately from (|32[) . 

(B) Product rule for f 

For all H C G with H := G n H and h" 



G n H, one has 



G a (H) = G a (H ) = G a (H ) for all a e G and 
G Q (ff) = G a (H" ) = G„ (ff" ) for all a € g" , 



so that consequently 



T a (H) = T a (H ) forallaeG and 
T a = rl' ) for all a G G" . 

This then leads to the following product rule for the function / from (|36|) : 
f(T,H) = g(T y (H))- 11 g(T a (H)) J] g{T p (H)) 

a£G' /3eG" 

= g(T y (H))- J] g(T a {H)) J] 3 (7>'(//")) = < 7 (r 7 (i/))/(T',//')/(r",i/" 



Note that in case G 



aGG' /3eG" 



or G =0, the corresponding empty product is 1 as usual. 



(C) Assembly of initial branching point and subtrees 

As an immediate consequence of Definition [3j one obtains for all G C m" 1 ^) C H C G: 
{7} U |J = G 7 (H\C) = G 7 ((iJ U {7}) \ G) . 



(39) 



(40) 



(41) 



(42) 



(43) 



(In fact, this relationship is not restricted to 7 but also holds for arbitrary a £ G, in an analogous 
way; but this will not be used in what follows.) 
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(D) Subtree summation 
Recall the classic identity 



i=0 



Y^a l b n - l =l b - a ~ ' ^ ' (44) 



(n + l)a n , a = 6, 



where the second case is also the l'Hopital limit of the first. Together with (137[) . this implies 



E( A ^)*( A 



A G a {H) ~ A 



(i a ) ' (( a g2(h)) T - ( A 
g{T a {H)){{X^} H) Y - {x^y) 



(45) 



for all T = (G, m), ^ G C L, a G G and H C L. 



We now continue the proof of the theorem and proceed via induction over |G|. For G = 0, the 
claim holds trivially, that is, P T (T) = X T for all r > 0. For G C L with |G| = 1, i.e. G = {7}, both 
the left and the right subtrees are empty trees (see the left case in Figure [8]) . Using first (|38[) , then 
the result for G = G = on L' and L , respectively (see Remark [2] (5)), then (139|) . and finally 
(05]), we obtain 



1 = 2 = 

r-1 

= £>7 E A A 7 _1_I = S( T 7({7»)( A 7 - A 2>) • 
i=0 



We now assume the claim to hold for all tree topologies T = (G,m) for all G C L with |G| ^ k for 
some k ^ 1; by Remark (2) it then holds likewise with L replaced by a contiguous L C L as long 
as G C L. We next turn to G = {a 1; . . . , ct k+1 } C L and fixed T = (G, m); recall our convention 
ai < a.2 < . . . < ot k+1 . In the following, we will always write A U a := A U {a} and A \ a := A\ {a} 
for ACL and a £ L. We have to distinguish the two remaining cases in Figure [8] (middle and right): 
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Case 7 = a 1 

Then T is an empty tree while T has initial branching point 7 G G , i.e. mT 1 ^) = {7 }. We then 
find 



r-l 

P r (T) = f? 7 E AiP r -i-i(T)P r _i_i(T") 
i=0 

i=o 7 "eff"CG" 7 " 

= E (- 1 ) |H " hl (A;-^: ( H" ) -AT 1_< )/(r",H") 

i=0 y"eH"CG" 7 
r-l „ 

= ^E V0 E hl ( A G~((H"u 7 )\ 7 ") " A G^(H''u 7 ))' f ( T ' H ) 

i=0 -f"eH"CG" 

= E (-i) |H ' Vl ;?(r7((^"u7)\7"))(A GT((H » U7) \ 7 " ) -A^)/(r",^") 

7 "eff"CG" 

- E (-l) |ff " hl 3(T7(^"u7))(A G7(ff » U7) -AS)/(T",i/") 
7 "gh"cg" 

= E (- 1 ) |fl " l " 1 (^(( fl «u 7 )\ 7 »)-^/(T.(ff"u7)\7") 
7 "eff"cG" 

- E (- 1 )' ff ' Vl ( A G T ( ff "u 7 ) ~ ^)Z(^"U7) 
7 "gh"cg" 

= E (-l) |ffhl (AG 7 («)-AS)/(r,i/)- E (-l) lHl (^(H)-^)f(T,H) 

7 eifCG\ 7 " { 7 , 7 "}CHCG 
~/£HCG 



In the first step, we have used (|38|) . in the second the induction hypothesis (applied to T" on L ), in 
the third the product structure of the A's (J39J) , in the fourth (|40p (read in the backward direction) 
and (|43[) (applied to H = H" with C = {7"} and C = 0, respectively). In the fifth step, we have 
invoked (|45l) (separately on each term in parentheses), in the sixth step we have used (|42D with 
H = (H U7)\7 and H = H U 7, respectively, and finally we have changed the summation variable 
(where we set G = 7 U G ) . 



Case 7 £ {a 2 , ...,a k } 

Now we have to consider the left subtree T with initial branching point 7 and the right subtree 
T with initial branching point 7 , i.e. mT 1 ^) = {7,7 }. Proceeding in analogy with the previous 
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case, we obtain 

r-l 



*t(T) = E 4ff 7 P T _ 1 _i(T , )P T _ 1 _ i (T") 



i=0 
r-l 



X 

7 "eH"CG 
r-l 



I>U E E (-D^wfA:- 1 -* 



i=0 



YeH'cG' 7"ei?"cG" 



G 7 ((tfuff"u 7 )\{7',7"}) 



- A r 1 — ' — A T — 1 ~* 4- A r 1 — ' l f(T H \ f(T H ) 

*G y ((H'uH"vy)\-f") G 1 ((H'UJ/"U7)\V) G J (H I UH" Uj) J n > )J\ > J 

= E E (- 1 ) |H ' l+|H '' l (3(^((^u^'u 7 )\{ 7 ', 7 ''}))(A^ ((ff , u ^ U7A{7 , 7 ,, }) - 

j'eh'cg' Y'eH"cG" 
-g(T 7 ((H UH u 7 )\{7 }))(A G 

,((K'UH"U7)\{7"})) A«r) 

- S (T 7 ((H U H" U 7 ) \ {7'}))(A^ (( ^ u ^ UT)v{ y }) - \l) 

+ S (T 7 (ff'uif"u 7 ))(A^ (H , uff » U7) -AS))/(T' )J ff')/(T" I -ff") 
= E M) |ffhl (A^ ( H)-A&)/(T,iJ)- £ (-1)I^I(A^ W -AS)/(T,//) 

7£iICG\{7',7"} { 7 , 7 '}CffCG\{7"} 

- E (~l) lHl (^ { H)) T -^).f(T,H)+ E {-i) W ~\^H)-^)f(T,H) 

{7,7"(CffCG\{7'} {7, 7 ',7"}CffCG 

= E (-l) |ffhl (A& T ( ff )-AS)/(T,^). 

7S//CG 

The remaining case 7 = is analogous to 7 = ai. □ 

Now that we have an explicit formula for the probability of our tree topologies, let us further comment 
on its structure. 

Remark 3 Using the subtree decomposition of Def. [3j we can state (|36p alternatively as 

f(T,H)=Y[ J] g{T p (H)), (46) 

aGH peG a (H) 

which implies some kind of independence across subtrees. 

Returning to the segmentation process, we may conclude that 

P(Fr = G) =^P(F T = (G,m)), 

m 

where the sum is over all mappings consistent with a full binary tree. The final result then follows 
directly from Theorem [3] 

Corollary 1 (Solution of recombination equation via ARTs) The discrete-time recombination 
equation (J5j) has the solution 

Pt = E a c( i ) i? G(Po), 
GCL 
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where 

a G (t) = J2nFt = (G,m)) (47) 

m 

for all G C L, where the sum is over all tree- consistent m, with ¥(Ft = (G,m)) = Pt(T) as given in 
Theorem g] □ 



4 Discussion 



The piece of research presented here has solved the long-standing deterministic single-crossover 
dynamics forward in time by considering the corresponding stochastic process (the Wright-Fisher 
model with recombination) backward in time, via looking at single individuals and tracing back their 
ancestries. We will first compare the result with the previous recursive approaches and then turn to 
the connection with the ancestral recom bination graph (ARG; the u sual approach to recombination 
in finite populations, see Wakelev 2008L Chap. 7.2 and Durrett 2008, Chap. 3.4). 

Evaluating the coefficients a G (t) via the traditional recursive approaches is an algebraic strategy, 
which relies on linearisation and diagonalisation of the underlying (forward, deterministic) dynamical 
system. In contrast, the ART approach presented here starts from a summation over all paths of the 
(backward, stochastic) process that give rise to a given set of segments after t generations; this is 
reflected in the sum over all ultrametric trees, as in ()33[) . This formula contains sums over all tree 
toplogies and over all combinations of branch lengths, which is not useful in itself, in particular for 
large t. The simplification obtained here consists in carrying out the summation over the branch 
lengths, so that one is left with the tree topologies only. This is particularly useful for large t and 
small recombination probabilities, since long branches (where nothing happens) are 'contracted' in 
this way. 

Both the recursive and the ART solution are of similar computational complex ity. Wheter a n (t) is 
evalu ated via ARTs (Corollary[T]and Theorem [H) or by solving th e recursions fe.g. Jvon Wangenheim et al 
2010, Eqs. (43)— (45) or the related ones of iDawsonlbood l2002h , the effort grows exponentially with 
n := \G\. In the recursions, the complexity comes from multiple sums over nested sets of subsets of 
G. In the case of the ARTs, it is due to the summation over all possible tree topologies with internal 
node set G; their number g rows expon entially with n . To be precise, there are C(n) such topolo- 



gies (jGross and Yellenlll999l Chap. 3.4; IStanlevlll999l Ex. 6.19.d), where C{n) is the n'th Catalan 



number. After all, clever algorithms are avai lable for the generation and enumeration of these trees 
(|Guptalll99i |Proskurowskilll980l: IZakslll980l) . 

The ART formula is therefore not superior in computational terms. However, it bears the great 
advantage to relate to objects with an immediate meaning in terms of the underlying process. After 
all, the probabilities for the tree topologies may lend themselves to future use if, for example, one 
is interested in the distribution of tree shape(s), or if mutation is included in the model (that is, 
superimposed on the trees) . This is in contrast with the manifestly non- intuitive recursions, for which 
we are not aware of an interpretation in terms of the underlying process. 

The backward approach that gives rise to the ARTs differs from the ARG in two ways. First, we 
let N tend to infinity without rescaling any parameters; that is, recombination probabilities remain 
constant when N —¥ oo. This corresponds to the assumption that recombination is so strong (that 
is, loci are so far apart) that the majority of the recombination events takes place before coalescence 
sets in. In contrast, the ARG assumes weak recombination (in that recombination parameters scale 
inversely with population size) , so that recombination and coalescence take place on the same time 
scale. Second, we focus on the ancestry of single individuals rather than of samples, which further 
simplifies matters. As a reward, one obtains semi-explicit answers for all quantities of interest here. 
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The differen ces between the scope s of the two approaches can be illustrated nicely in the context 
of the paper by Wiuf and Heinl (1997), who analyse the ancestry of the genetic material of an entire 
chromosome from a single individual. More precisely, they investigate the partitioning process (a 
close relative of our {E t } t ^q) at stationarity, i.e., for r — > oo, in the diffusion limit. They employ 
approximations and simulations and focus on the concrete example of the human chromosome 1, with 
realistic estimates of the recombination parameters and the effective population size. They do find 
large contributions from unordered partitions, in the sense of the frequent occurrence of segments 
of ancestral material interspersed with nonancestral ('trapped') material between them. This is a 
consequence of the large number of genetic ancestors of the chromosome (estimated at 6800) relative 
to the effective population size (N e = 20000). (For loci at opposite ends of the chromosome, the 
diffusion limit is, however, not expected to yield a good approximation, see below.) 

In contrast, our approach does not aim at any stationary situation. Rather, it should provide 
a faithful picture in situations where recombination rates and population size are large enough so 
that there is a time horizon governed by recombination alone. The detailed time course of individual 
ancestries over this time horizon will then be described by the segmentation process. Let us note 
that a well-known metho d of simulating the ARG (the so-called sequential Markov coalescent by 
McVean and Cardinbooi) also uses an approximation of the partitioning process by the segmentation 



process in that coalescence of sequences that both carry ancestral material are neglected; the authors 
demonstrate that this yields a good approximation to the full ARG over a wide range of parameters. 

To be more precise, we stipulate that there are, in fact, three scaling regimes to be considered 
in the context of recombination: weak recombination, strong recombination, and free recombination. 
We can certainly not delineate them precisely here (and they are also expected to overlap). However, 
with the usual assumption of a recombination pro bability of the order of 10 -8 per generation and 
base pair of a DNA sequence (jKauppi et al.l 120041) . it is clear that sites at a distance of the order 
of 10 3 -10 4 base pairs will fall into the weak recombination regime: The recombination probability 
between them, of 10 -5 -10 -4 , is of the same order as the coalescence probability of l/N (between 
any pair of branches per generation). In contrast, sites at a distance of 10 8 base pairs (like the 
opposing ends of a chromosome) will be essentially independent since there will be, on average, 
one crossover between them in every generation; this is the case of free recombination. Between the 
extremes, at a distance of 10 6 base pairs, say, the recombination probability of 10~ 2 between a pair 
of sites is well-separated from the coalescence probability of 1 /TV per pair of branches for all but 
the smallest populations; this should be a case for strong recombination over a time horizon where 
the number of branches is not too large. Note that the 'number of branches' that counts here is 
the number of genetic ancestors of an individual (that is, those that carry ancestral material), not 
the number of genealogical ancestors (that is, all parents, grandparents and so on). The number 
of genetic ancestors only increases roughly linearly with (backward) time r, w hereas an indiv idual 
has up to 2 r genealogi cal ancestors r generations back; see the discussions by iDonellvl (|1983|) and 



Ralp h and CoorJ ( 20131) . Note also that, in this parameter regime, the assumption of at most one 



crossover per generation is well justified, while recombination is still strong relative to coalescence. 
Last not least, it is worth mentioning that this is the parameter regime where the deterministic 
dynamics yields a valid description. 

The results presented here should also pave the way towards a solution of the multiple crossover 
model. Biologically, this is relevant when more distant loci are considered. Recombination in a given 
generation will again be described by a partition of S into two parts (corresponding to the two 
parents), but, this time, these partitions may be arbitrary (as opposed to the ordered partitions that 
arise due to single crossovers). But we expect that the corresponding ancestral recombination trees 
will continue to be binary trees whose subtrees are conditionally independent, so that the methods 
developed here may be generalised to this case involving arbitrary partitions. 
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